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Nodes in real-world networks are usually organized in local modules. These groups, called com- 
munities, are intuitively defined as sub-graphs with a larger density of internal connections than 
of external links. In this work, we introduce a new measure aimed at quantifying the statistical 
significance of single communities. Extreme and Order Statistics are used to predict the statistics 
associated with individual clusters in random graphs. These distributions allows us to define one 
community significance as the probability that a generic clustering algorithm finds such a group 
in a random graph. The method is successfully applied in the case of real-world networks for the 
evaluation of the significance of their communities. 

PACS numbers: 89.75.Fb,89.75.-k,89.70.Cf 



I. INTRODUCTION 

Complex networks play a crucial role in understanding 
physical, biological, social and technological systems [T]- 
[3] . Interactions between proteins in cells of living organ- 
isms, relations between human actors in socioeconomic 
contexts and connections between Web pages in the 
World Wide Web can naturally be described as graphs. 
Real-world networks typically have complex topological 
properties, but in spite of their evident diversity, struc- 
tural analysis has revealed that they share a conspicuous 
set of common features: scale- fr 'eeness (i.e., the number 
of connections per node following a wide or power-law 
distribution) [T] and small-worldness (i.e., the average 
number of hops between two nodes in the network scales 
logarithmically with its size) [4] are two celebrated exam- 
ples of such properties. Recent studies have focused on 
deeper structural features of networks. Real-world net- 
works are typically organized in local clusters of nodes 
which are usually denominated communities. Communi- 
ties are groups of nodes with a higher level of interconnec- 
tion among themselves than with the rest of the graph. 
In this sense, communities are groups relatively isolated 
from the other nodes of the network and are expected 
to represent elements sharing common features and/or 
playing similar roles within the system (see Ref. [5] for 
an exhaustive review). For instance, if one considers the 
World Wide Web, communities are composed by groups 
of Web pages dealing with similar topics; in social net- 
works, communities stand for sets of actors sharing com- 
mon interests, ideas and friendship relationships; in pro- 
tein interaction networks, communities represent groups 
of proteins with similar functionalities. 

This imbalance of in- and out-connections corresponds 
to an intuitive concept. There are some formalizations 
of the definition of community. The LS set [6j or strong 
community [8] stands for a group where every node be- 
longing to the group has more internal connections than 



external ones. A less restrictive definition refers to a 
weak community [8] as a set of nodes where the number 
of intracommunity connections (summed over all nodes 
within the group) is larger than the number of links go- 
ing out of the community. Along these lines, the well 
known modularity is a quality function able to quantify 
the statistical importance of a partition comparing the 
number of internal connections observed in the communi- 
ties with its expected number in a suitable null model [5] . 
Besides the formulation of a definition, big efforts have 
been made for the detection of communities in networks. 
Since the total number of possible divisions of a network 
in subgraphs is a non-polynomial function of the size of 
the network itself, finding and detecting communities is 
not a trivial issue. Many algorithms have been proposed 
during recent years, every of them with the same spirit of 
finding the best groups which maximize the internal den- 
sity of links l9Tf24] . Different principles may be used, 
but nevertheless in all cases some property related to 
the community structure is locally or globally optimized. 
The consequence is that even in uncorrelated networks 
these algorithms find clusters that are supposed to be 
good according to the modularity function or to other 
quality measures. 

If algorithms are able to identify communities even in 
random graphs, which value can we give to communities 
found in real networks? Or better, how to statistically 
determine the significance of a community? This prob- 
lem has been the subject of some studies in the litera- 
ture [201 [25H2Q]. In [201 [27] for example, the partition 
of a network maximizing the modularity is compared 
with the maximum modularity partition of a random- 
ized version of the given network (i.e., all edges are ran- 
domly rewired). In [29], differently, the importance of 
a community partition is proportional to its robustness 
against random perturbations (i.e., random reshuffling of 
edges). Such heuristic approaches rely on the modular- 
ity function to evaluate the quality of a partition, which 
means that are subjected to the modularity resolution 



2 



rest of the network 




node i 

Figure 1: (Color online) Sketch of the theoretical framework 
referring to the null-model. Node i has ki free ends to allo- 
cate. Each of them can connect to nodes within C or vertices 
belonging to the rest of the network. 



limits [2T], [3T] . Furthermore, all the proposed methods 
are designed to deal with full partitions, not with single 
communities. Even though in a network one might find 
some meaningful communities alongside with randomly 
connected node clusters. In this paper, we develop a sta- 
tistical method aimed at discriminating between a single 
bona fide community and structures arising as topologi- 
cal fluctuations. Instead of a direct comparison with an 
average outcome, the community is confronted with the 
best expected result for a null-model. The reason for 
stressing this "best outcome" is that community detec- 
tion algorithms will in general produce the best possible 
clusters given a graph, even if it is random. The thresh- 
old of significance can be approximated by using Extreme 
and Order Statistics [33] [33] applied to null-model com- 
munity fitness. A community significance can be then 
obtained as the extreme probability of finding a group 
equal or better than the one given in a set of equivalent 
random graphs. 



II. NULL-MODELS AND DEFINITION OF 
C-SCORE 

Consider a scenario as the one depicted in Fig. [TJ with 
a given community C in a graph, fej denotes the num- 
ber of connections (degree) of the node i. Given C, fc, 
can be divided in two terms: k\ , the number of links 
connecting i to nodes in C, and k^ xt , the number of con- 
nections outside. Similarly, we define the internal degree 



SieC^}"*' as well as m^f* = J2 



of C, 



and its total degree mc = TO, 
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very simple stochastic null model: all connections inside 
the group are locked (the community is given so cannot 
be altered) , while the other links are randomly reshuffled 
among all nodes preserving their degrees. For simplicity, 
we allow the rewiring operation to form multiple links 



(two nodes can be connected by more than one edge) or 
self-loops. In some weighted graphs the weights of the 
links are equivalent to multiple connections and so the 
present null-model would be appropriate. Some exam- 
ples are social networks (the Zachary club [31], see last 
section) or the C. Elegans metabolic network |16j that 
will be analyzed later. For unweighted graphs, we have 
checked that our results do not noticeably change by in- 
cluding or not multiple links as long as the graph is not 
condensed (a node gets a finite fraction of the links). 
When node degrees are much smaller than the network 
size, the probability of generating self-loops and multi- 
ple links by random reshuffling becomes negligible. Note 
also that our null model is similar to the one used for 
the definition of the modularity 9J and close in spirit 
to the configurational model [35]. It generates graphs 
that have no special internal structure except that given 
by random fluctuations, keep the degree sequence of the 
original network and can show degree-degree correlations 
only if the degree sequence and the network size deter- 
mine their presence [36]. This is the most general null 
model, appropriate when no knowledge about the sys- 
tem is available and simple enough to be treated from an 
analytical point of view. If further information regarding 
the constraints present in the process that generated the 
given network is available, other, simpler or more elab- 
orated, null models can be employed. Our method to 
evaluate group significance is general enough to admit 
the use of different null models by altering consequently 
the distributions that will be described next. 

Once the null model has been selected, suppose that 
C is a group composed of randomly chosen nodes and 
consider a generic node i not belonging to C. The distri- 
bution of k\ is given by the hypergeometric distribution 
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f(*t* I c) = " Kl "" * l ~ k]nt ' , (1) 



where (^) = ^J^, x! is a binomial coefficient, and m* 

are the free ends in the network: m* = m — mc (to are 
the total ends in the graph, twice the number of links) . 
Eq. |lj states that the probability of node i to have k" lt 
internal connections to C is given by the ratio of two 
terms: the total number of ways in which k\ links can 
be placed at the end of m c xt free ends multiplied by the 
number of ways to locate the remaining fcj — k* nt edges 
out of to* — m c xt free ends, divided by the total number 
of ways to place all ki connections in the network (i.e., 
out of to* free ends). If the node i belongs to C, Eq. 
has to be corrected to exclude i from the group. When 
the group C is composed of tiq randomly chosen nodes, 
Eq. ([I]) recovers the results obtained via numerical sim- 
ulations (see inset Fig. [2]) . 

The next, more interesting, case is when C is not com- 
posed of randomly chosen nodes, but it has been detected 
by a clustering algorithm. As can be seen in the main plot 
of Figure [2] the shape of f(k mt ) dramatically changes 
due to the algorithm node selection. Most of the nodes 
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Figure 2: (Color online) Distributions f(k mt ) and g{k mt ) for 
randomly generated networks of homogeneous degree. The 
(black) circles are numerical results with N = 100, nc = 20 
and k = 15. The distributions refer to external nodes of 
groups selected at random (inset) or by maximizing modular- 
ity (main plot). Dashed (red) curves are the approximation 
given by Eq. |l]) and the continuous (blue) one that of Eq. 



populating the tail of the distribution are incorporated 
into the group. Correlations are also present since nodes 
in the community are expected to be connected among 
themselves. Still, it is possible to obtain an approximate 
expression for the probability f(k mt ). We first consider 
the case of homogeneous graphs where all nodes have the 
same degree (i.e., fcj = k , Vi) and extend later our anal- 
ysis to networks with arbitrary degree sequences. We 
will assume that C has been selected to maximize k) nt 



for each node inside C as well as the overall 



This 



also implies that since the nodes are all equivalent, have 
the same degree k, they can be ranked according to their 
k mt . We indicate with w the node (or nodes) with the 
lowest k mt within the community (see Fig. [TJ . fc™* , the 
internal-degree of the worst node, establishes then an up- 
per cut-off to the possible values of k mt of the out-group 
nodes. An expression similar to Eq. can then be de- 
rived for the external nodes by taking into account this 
new cut-off 



( m c\ v ( m ~ m c \ 

\ki nt ) vfc*«*-fc*W 



(2) 



where fh = (N — nc) . The term m accounts for the 
fact that no node can connect to more than fc™ 4 inter- 
nal vertices and therefore some of the free ends m* be- 
come occupied. Eqs. and ^ specify the null-model. 
Our method does not depend on the particular func- 
tional shapes of f(kj nt ) and g{k l j nt )- For instance, a more 
restricted null-model without multiple links can be ap- 
proximated by using Wallenius hypergeometric distribu- 
tion, although this considerably complicates the numeri- 



0.8 



•S a 

0.4 



0.2 
0, 



1 1 1 1 


1 1 1 1 


1 / 
fl 

/ ' 

/ / 
.1*1 


\ - 

1 x 
\ \ 
\ 

1 ■ V- , 







0.8 
0.6 
0.4 
0.2 

8 10 °6 



1 1 


A/* 








A * 




;V ; 














m m 


. 1 k i. i 1 i 



4 6 8 10 

, int 

k 

w 



Figure 3: (Color online) Probability distribution P(fc*™') for 
the internal degree of the worst node of the community calcu- 
lated for groups C detected by maximizing the modularity in 
randomly generated networks of N = 100 (left) and JV = 300 
(right) with homogeneous degree k = 15 and for groups of 
nc =20. The black circles are the results of numerical sim- 
ulation, while the continous blue curves correspond to the 
theoretical predictions derived from Eq. j3j. The extreme 
value distribution of f(k lnt ) from Eq. |l| is also plotted for 
comparison (violet dashed curve). 



cal evaluation of the functions. Another null-model, less 
realistic but very easy to implement, is the Erdos-Renyi- 
like networks for which /(fc* nt ) and g(kl nt ) are binomial 
distributions. 

The worst node within the community, w, will play a 
central role in our method to evaluate group significance. 
We assume that in a random graph there is not a dras- 
tic variation between fc™* and the internal degree k mt of 
the best nodes outside the group. Postulating a smooth 
variation of k mt between inside and outside of the com- 
munity allows us to find an expression for the probability 
distribution of fc™' based on Eq. (|2) which only applies 
to external nodes. The degree of the worst node, fc™*, 
is a given quantity in g(k mt \ C^k 1 ™ 1 ). In order to find 
a formula for P(/c™*), we need thus to alter our point 
of reference and consider the second worst node within 
the community w' . If the statistics of fc™* is comparable 
to that of the best external nodes, fc™' should follow the 
distribution of the extreme of g{k\ nt \ C\{w}, kffi). This 
means that the probability for fc™* to be lower or equal 
to a certain number reads 



Pr( 



<k™ t ) = [G(k™ t \C\{w},kff)} 



N-nc + l 



(3) 



where G(-) is the cumulative of the function g(-). The 
distribution P{k"^) is given by the derivative of the cu- 
mulative of Eq. |3|, P(fcjf) = <9Pr(< fc™*). It must be 
remarked that Eq. ^ is valid for independent random 
variables, in our null model the independence is justified 
for external nodes and is an approximation when refers to 
W. Figure [3] shows a comparison between the distribution 
P(fc™*) obtained with this procedure and its counterpart 
from numerical simulations. Despite the approximations 
performed to reach an analytical form for P(fc™*), the 
agreement is remarkable. The use of Extreme Statistics 
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contributes in part to such agreement, since under very 
general conditions the limit extreme- value distribution is 
stable and has no memory of the parental distribution. 

Once a functional form for Pr(< fc™') was obtained, 
we can define a measure of the significance of a group, 
the C-score, as 

c = Pr(> O = 1 - Pr(< (C* - 1)), (4) 

which corresponds to the probability that fc™* for an op- 
timized community in an equivalent random graph en- 
semble is higher than or equal to the value seen in C. 
A point to stress here is that c contains not only infor- 
mation about the worst node, fc™*, but also about the 
community external links and about the degrees of the 
external nodes. 

In order to extend our results to heterogeneous graphs, 
we need to rank the nodes according to the role they play 
with respect to the given community C. For regular net- 
works, since all the nodes arc equivalent, the ranking can 
be simply established by considering the values of the 
internal degrees k mt . However, another criterion is re- 
quired to deal with heterogeneous networks. We use the 
probability distribution provided by Eq. ([!]) as the ba- 
sis for such procedure. The rank for a node i can be 
established by the probability of finding a node with an 
internal degree fc™* or higher in the null model given its 
degree fcj and C. That is, for each node i we calculate the 
score n = X^Lfc" 1 ' / (<?) ano - then perform comparisons 
on the basis of r. The values of r fall in the interval [0, 1] 
regardless of the node degree, which facilitate the com- 
parison, w and w' correspond thus to the nodes with the 
highest and second highest values of r within the com- 
munity, respectively. Under the hypothesis of a randomly 
connected network, the scores r of the vertex w, r w , and 
that of the external nodes can be seen as random vari- 
ables uniformly distributed in the interval [rv/,1]. The 
C-score can be then calculated as the probability of ob- 
serving r w as the minimal value of a set of (N — tiq + 1) 
random extractions from a uniform distribution defined 
in the interval [r w i , 1] . An alternative to this last step is 
to map the internal degree of w' into kffi (the internal 
degree that it would have if its degree was equal to k w 
and its score r w >) by inverting the distribution of Eq. (JlJ. 
Once the transformation has been performed, we can pro- 
ceed in the same way as for homogeneous networks with 
Eqs. ^ and @. 

III. BEYOND THE C-SCORE 

A low value of the C-score (i.e., c < 5%) is enough 
to consider a group as significant. However, when the 
C-score is higher, one could argue that the reason is that 
relaying only on the worst node of the community for 
the full group evaluation is a too severe criterion. Al- 
gorithms may fail to place a single node and this would 
translate into a non significant community according to 



the C-score approach. The performance of the method 
can be improved by a further refinement. Instead of con- 
sidering only the last node, one can include a longer list 
of nodes and use this information for the computation of 
the statistical significance of the community. A way to 
do so is to write an algorithmic procedure. Three classes 
of nodes can be considered: The community C, the "bor- 
der" B and the rest of the network. Initially, the group 
Bq is empty and Cq = C. Then at each algorithm step, 
the following actions are taken 

• Compute r, = Y^ k q =k int f(l)' wnere the function 
/(•) is given by Eq. n is calculated for each 
node i 6 C with respect to the group C*; 

• Determine the worst node in Ct, Wt+i, as the vertex 
with highest r Wt+1 . Set B t +\ = B t U {w t+ i} and 
Ct+i =C t \ {w t+ i}; 

• Compute Pr(< S t +i\C t +i,B t +i,r Wt+2 ), where 
St+i = J2ieB t+1 Tw i an d w t+2 is the worst node 
still in Ct+i ; 

• Increase t—¥t + l. 

This algorithm explores the interior of the community 
trying to maintain the worst nodes always in B, it ends 
when t = n c - 1. Pr(< S t+ i\C t+ -L,Bt+i,r Wt+2 ) stands for 
the probability that the sum of the scores of the worst 
t nodes of an optimized community in an ensemble of 
equivalent random graphs is smaller than the given for 
C . Its value for a set of independent random variables can 
be estimated by using Order Statistics (see Appendix [A| 
for more details). We define then the S-score as 

K-score = min Pr(< S t \C t ,B t ,r Wt+1 ) , (5) 

which corresponds to the lowest value of the probability 
Pr(< St\Ct, Bt,r Wt+1 ) observed during the iterative pro- 
cedure. We take the minimum as the best approximation 
for the significance of the group C, since it is evaluated 
in the most favorable discrimination of C nodes in border 
and core. This probability is equivalent to the C-score 
for t = 1, while becomes a more synergic quantity as t 
increases. The inclusion of a longer list of worst nodes 
in the calculation helps to correct conservative estimates 
due to under-sampling. When communities are signifi- 
cant with respect to the C-score they are significant also 
according to the i3-score. Vice versa, low values of the 
S-score do not necessarily correspond to small C-scores. 
Many concomitant bad nodes with features slightly dif- 
ferent from the random expectations may multiply their 
effect and lead, if there is a real signal, to the prediction 
of a significant community. 

IV. COMPUTATIONAL BENCHMARKS 

As a first test, we applied the C- and the S-scores to 
groups found in random graphs using clustering tech- 
niques. The C-score and the £>-score are able to identify 
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Figure 4: (Color online) Cumulative distribution of the C- 
score for groups of size nc = 20 obtained in random networks 
of N = 100 nodes by modularity maximization, (a) Homo- 
geneous networks with degree k — 15. (b) Heterogeneous 
networks with average degree (k) — 15. The degree distri- 
bution follows a power-law P(k) ~ k ' with 7 = 2. The 
gray areas delimit the values of C-score that indicate group 
statistical significance (probability < 5%). 



these groups as not significant (see Figure [4]). The results 
confirmed that the scores are good estimators for the 
statistics of such groups further contributing to our con- 
fidence in the method. We consider next the performance 
of the scores on artificial networks with planted commu- 
nity structure. In order to do so, we build networks in the 
spirit of Girvan and Newman's benchmark [7J . Since our 
aim is to evaluate a single cluster, the benchmark will 
be composed of a group C with 32 nodes and of other 
96 nodes in the rest of the network. Every node in C is 
connected on average with (k mt ) nodes of its own group 
and (k ext ) outside. The external nodes are connected at 
random. The average total degree for all the nodes is 
fixed at (k) = 16. (k exi ) acts thus as a control parameter 
for the strength of the community structure. The higher 
it is, the more prominent the disorder of the connections 




Figure 5: (Color online) a) C- and B-scores for communities in 
benchmarks. The disorder in the connections increases with 
(k ext ). The continuous (green) curve corresponds to the tar- 
get distribution obtained numerically (see text for details), 
b) Distribution P(m? c ; ), continuous curves are for the bench- 
mark with the {k ex } shown over the curve. The black circles 
are the numerical distribution measured for an equivalent ran- 
dom graph with the group found by maximizing modularity. 




mixing param. 



Figure 6: (Color online) C-score (red circles) and ,6-score (blue 
squares) calculated for communities in LFR (heterogeneous) 
benchmarks. The scores are displayed as a function of the 
mixing parameter, k ex /k. The benchmark networks size is 
N — 1000, the (k) = 15 with a degree sequence exponent of 
7 = —2 and a size of the community nc = 50. 



becomes. The scores are shown in Fig. [5^, as a function 
of (k ext ). Both are able to detect the increasing disorder. 
Although, as expected, the C-score is more conservative 
than the S-score raising for earlier values of (k ext ) and so 
claiming that the group could be found in random graphs 
before. The (green) continuous curve in the figure repre- 
sents a numerical estimation of the ideal function that we 
want to approximate with the scores. Before explaining 
how it is obtained, we need to describe the second panel 
of the figure. The distribution for the internal number of 
connections of C is displayed for the benchmarks at differ- 
ent (k ext ) as well as for equivalent randomized graphs in 
Fig. [5)3. The randomized graphs are obtained by reshuf- 
fling the connections of the benchmark networks and the 
groups of 32 nodes in them are found by modularity max- 
imization. The curves for the benchmarks start far away 
in the area of high m™ 1 when (k ext ) is low. As (k ext ) 
increases, they move towards the left and at a certain 
point, close to (k ext ) ps 8, cross under the distribution 
for the randomized graphs. This point marks the end 
for the significance of the community. Similar (or better) 
groups could be found in a random graph by a clustering 
algorithm. The continuous curve in Fig. [5^ is obtained by 
simulating this process. For each value of (k ext ), a set of 
instances of the benchmark are generated, m™' is mea- 
sured for each of them, and the green curve is calculated 
averaging the probability of the value m™* or a higher 
one (cumulative distribution) in the random graph curve 
of Fig. [5}d. The good agreement of this curve with the 
i3-score proves that, despite all the approximations, the 
S-score is a good measure of cluster significance. 

As a final test on benchmarks, we have evaluated the 
scores performance on the benchmark proposed by Lan- 
cichinetti et al (LFR) in Ref. [37]- This technique to 
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generate graphs with planted community structure is a 
generalization of Girvan and Newman's method to net- 
works with heterogeneous group size and degree distribu- 
tion. As before, the nodes have k mt connections within 
its own group and k ext — k — k mt edges linking elsewhere. 
The mixing parameter k ext jk indicates the "strength" of 
the communities. The scores shows a great ability in 
characterizing the modular structure of the benchmark 
as we increase the mixing parameter as can be seen in 
Figure [6] Due to the absence of fluctuations all the com- 
munities are well defined until each node shares almost 
half of its connections with nodes of its group, while the 
groups become less defined for larger values of the mix- 
ing parameter. When about the 60% of the links connect 
with nodes outside the a priori established groups, the 
communities become equivalent to those found in random 
graphs. 
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Figure 7: (Color online) Iterative application of the ,6-score 
in order to detect the presence of an internal organization 
in groups. At each stage, we remove the worst node of the 
community and compute the B-score for the remaining group. 
We consider two examples: a) a well defined cluster with the 
addition of a few random nodes; and b) a group composed by 
the union of two (good) communities. 



V. EXPLORING THE INTERIOR OF A 
COMMUNITY 

An interesting application of the scores is the explo- 
ration of the internal structure of groups. One could de- 
cide to remove the worst node from the community as we 
did to measure the S-score and recompute the scores for 
the remaining group. The operation can be repeated iter- 
atively as long as there are nodes remaining in the group. 
Interestingly, this process is able to identify the presence 
of internal structure in groups of vertices if the original 
community displays internal modularity. Figure [7] shows 
two examples of the described operation. The 23-score is 
plotted as a function of the number of removed nodes. 
We consider two different examples: a well defined clus- 
ter (generated with the LFR benchmark) plus some ran- 
domly added nodes (Figure^); and a group composed of 
two clusters connected via few random links (Figure [T]b) . 
The iterative procedure is able to detect and set out the 
randomly added nodes (Figure [7^,), and also to find the 
deeper internal structure inside the two-elements cluster 
(Figure^). 

This procedure also allows us to define more detailed 
measures for the quality of a community. We can search 
for deeper and deeper cores in the community that we 
will call C-q or B-q core. Fixed a level of significance q, 
the C-q (or B-q) core corresponds to the largest sub-group 
of a community with C-score (S-score) lower than q. In 
practical applications, a reasonable value of q is 5%. As 
we will see next, this concept turns out to be a useful 
tool to characterize communities in real networks. In the 
case of the benchmarks, the average sizes of the C-qcores 
obtained for the GN-like networks at q = 5% are close 
to 32 up to (k ext ) = 8. At this level of disorder, some 
nodes stop being significant for the planted communities 
and therefore come excluded from the g-core. For higher 
disorder levels, the cores further reduce until eventually 
vanish. 
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Table I: Analysis of real networks with known community 
structure. For each network the table reports, from left to 
right, the name of the network, the size of its communities 
nc, the C-score , the ,6-score, the size of the C-5%core and the 
size of the Z?-5%core. 



VI. EMPIRICAL NETWORKS 

We show now the utility and versatility of our method 
for the statistical evaluation of communities in real net- 
works. An exhaustive study of the networks with mod- 
ular structure in the literature has been performed, the 
following are only a few examples. We report results on 
social networks such as the Zachary karate club [34] or 
the one extracted for the characters of the novel Les Mis- 
erables [35] or for biological networks such as the C. El- 
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egans metabolic network [IB]. In two cases, the Zachary 
club and the college football networks, the structure of 
the groups is a priori known. In the Zachary club because 
the network split in two separate groups due to internal 
dissensions, and for the college football because the con- 
ference in which the teams play is a given data. It is also 
important to note that some of these networks as, for 
instance, the Zachary club or the C. Elegans metabolic 
network are weighted graphs for which the weights of the 
links are equivalent to multiple connections. We have 
analyzed both the weighted and unweighted versions and 
report both results in the case of the Zachary club. The 
evaluation of the groups for the a priori known commu- 
nities is summarized in Table |U While the results for the 
communities obtained maximizing the modularity with a 
simulated annealing technique are displayed in Table [TT] 
There are some general observations valid for all net- 
works. The C-score is often able to discriminate good 
communities, although sometimes a more sophisticated 
approach as the S-score is needed. There are also a few 
cases in which the S-score reverts the judge based on the 
C-score, meaning that a deeper analysis of the communi- 
ties was required. An example of this type is for instance 
the Zachary club 2-partition. However, when the origi- 
nal graph with the weight information is considered its 
communities become more significant. This seems to ap- 
ply also to the other weighted graphs, showing that there 
is a connection between clustering structure and weight 
location in these networks. We also show the sizes of 
the 5%-cores of each community in the Tables as well as 
detailed analysis of one of the communities of the C. El- 
egans metabolic network in Fig. [8] 



VII. CONCLUSION AND DISCUSSION 

Finding structure in graphs has direct implications for 
the study of several empirical disciplines as well as for 
a general understanding of the phenomena behind the 
evolution of the systems in which such structures raise. 
Communities are the most direct and easy-to-envisage 
example of network structures. This concept is a direct 
heir of the intuitive idea of closer groups when consider- 
ing social networks. As such, it has had a long history 
with a good number of algorithms proposed to detect 
communities in graphs. There are however two impor- 
tant issues missing in the literature. A firm mathemat- 
ical definition of what a community means and a clear 
way to determine which of the outputs of the community 
detection algorithms are really significant. 

In this work, we have focused on the second question 
with the hope of giving even if partially a hint of where 
the answer to the first one can lay. A new measure able to 
statistically quantify the meaning of a single community 
in networks has been introduced. This measure, called C- 
score, represents the probability of occurrence of a group 
with the same properties (i.e., same number of nodes, 
nodes with the same degree sequence and same internal 
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Table II: Analysis of the community structure of several real 
networks via modularity maximization. For each network the 
table reports, from left to right, the name of the network, the 
size of its communities nc, the C-score , the B-score, the size 
of the C-5%core and the size of the 6-5%core. The community 
highlighted in Figure[8]is marked as (Green) in the table text. 



connections) under the following hypothesis: (i) nodes 
in the network are randomly connected; (ii) the group is 
chosen, among all possible groups with the same prop- 
erties, because is the one which maximizes the density 
of internal connections. The first hypothesis is a nat- 
ural assumption and a null model where links are ran- 
domly placed is very often used as term of comparison 
for the determination of correlations or other topolog- 
ical properties in networks. The latter one comes out 
from the common knowledge which prescribes commu- 
nities as groups with high intra-connectivity. Thanks to 
the theory of Extreme Statistics, we approximate the val- 
ues of the C-score in the case in which our hypothesis 
hold. We have tested the performances of the C-score on 
several networks, ranging from random graphs to arti- 




Figure 8: (Color online) Community structure for the C. Elegans metabolic network [16] obtained by modularity optimization. 
In (a), an overview of the graph partition is shown. In (b), we display a zoom of a single community depicting in red the nodes 
that are not significant group members. And (c), the C-gcore analysis of the community. 



ficial networks with controlled community structure, or 
to real networks with unknown internal organization. In 
all cases, we have been able to find good results. The 
method ability of evaluating one community at a time 
allows to detect situations in which only some of the com- 
munities of the graph are meaningful while the rest of the 
groups are equivalent to random fluctuations. This ap- 
proach is also flexible enough to deal with overlapping 
groups that share nodes between them, providing a sepa- 
rate evaluation for each cluster. Two further refinements 
of the C-score have been also introduced. One with the 
aim of exploring the internal structure of the communi- 
ties, the q-core, and another, the S-score, with the inten- 
tion of evaluating a community significance based on a 
group of nodes instead of on the worst node of the clus- 
ter. The computational complexity of the evaluation of 
the B- and C-scores grows quadratically and linearly with 
the community size, respectively. These tools constitute 
a set of statistical measures for a thorough evaluation of 
single communities, avoiding thus the blind acceptance 
of the output of clustering algorithms. 



The software to calculate the C-score 
and K-score of communities is available 
at |http: //f ilrad.homelinux . org/ cscore 
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Appendix A: The distribution Pr (< St \C t , B t ,r Wt+1 ) 

In section [III] we have outlined how to compute the B- 
score of a community. The iterative procedure makes use 
of the probability Pr (< St \C t ,B t ,r Wt+1 ) which has yet 
to be described. During the procedure for the computa- 
tion of the i3-score, the size of the border is increased by 
one at each stage. At step t, the border Bt is composed of 
the t nodes which, on the basis of their internal degrees, 
are less likely to belong to Ct-i- We have therefore a 
sequence of scores, r wi > r W2 > . . . > r Wt , for the t worst 
nodes. The score of the worst node, namely w t +i, still 
inside Ct represents a lower bound for the sequence, since 
by definition we should have that r Wt > r Wt+1 . Instead 
of trying to obtain the probability for the full sequence, 
we can simplify our problem and consider the sequence 
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sum St = Si=i r wi ■ Finding the distribution of St can 
be formulated as calculating the probability that, given 
a sequence of N — nc t i.i.d. random variables [we indi- 
cate by njr the size of a set F] , the sum of the t largest 
variables is less than St- The solution for this problem 
can be found in [32j [39] . The cumulative probability 
distribution is given by the expression 

Pr(< S t \C t ,B t ,r Wt+1 ) = 

i V fl« / lV -+i > y A1 > 



where 6* t = Integer- Value Lng t + 1 — £t ] an d & = 
(St — n Bt w t +i)/(l — Wt+i). Note that Eq. ( |A1[ ) is valid 
under the assumption of independent variables, which is 
justifiable to some extent in the case of random networks. 
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